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Abstract 



We show that for several variations of partially observable Markov decision processes, 
polynomial-time algorithms for finding control policies are unlikely to or simply don't have 
guarantees of finding policies within a constant factor or a constant summand of optimal. 
Here "unlikely" means "unless some complexity classes collapse," where the collapses con- 
sidered are P = NP, P = PSPACE, or P = EXP. Until or unless these collapses are shown 
to hold, any control-poHcy designer must choose between such performance guarantees and 
efficient computation. 

1. Introduction 

Life is uncertain; real-world applications of artificial intelligence contain many uncertain- 
ties. In this work, we show that uncertainty breeds uncertainty: In a controlled stochastic 
system with uncertainty (as modeled by a partially observable Markov decision process, for 
example), plans can be obtained efficiently or with quality guarantees, but rarely both. 

Planning over stochastic domains with uncertainty is hard (in some cases PSPACE- 
hard or even undecidable, see Papadimitriou &; Tsitsiklis, 1987; Madani, Hanks, &; Condon, 
1999). Given that it is hard to find an optimal plan or policy, it is natural to try to find 
one that is "good enough". In the best of all possible worlds, this would mean having an 
algorithm that is guaranteed to be fast and to produce a policy that is reasonably close 
to the optimal policy. Unfortunately, we show here that such an algorithm is unlikely or, 
in some cases, impossible. The implication for algorithm development is that developers 
should not waste time working toward both guarantees at once. 

The particular mathematical models we concentrate on in this paper are Markov de- 
cision processes (MDPs) and partially observable Markov decision processes (POMDPs). 
We consider both the straightforward representations of MDPs and POMDPs, and succinct 
representations, since the complexity of finding policies is measured not in terms of the size 
of the system, but in terms of the size of the representation of the system. 

There has been a significant body of work on heuristics for succinctly represented MDPs 
(see Boutilier, Dean, & Hanks, 1999; Blythe, 1999 for surveys). Some of this work grows 
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out of the engineering tradition (see, for example, Tsitsiklis and Van Roy's (1996) article on 
feature-based methods) which depends on empirical evidence to evaluate algorithms. While 
there are obvious drawbacks to this approach, our work argues that this may be the most 
appropriate way to verify the quality of an approximation algorithm, at least if one wants 
to do so in reasonable time. 

The same problems that plague approximation algorithms for uncompressed representa- 
tions carry over to the succinct representations, and the compression introduces additional 
complexity. For example, if there is no computable approximation of the optimal policy in 
the uncompressed case, then compression will not change this. However, it is easy to find 
the optimal policy for an infinite-horizon fully observable MDP (Bellman, 1957), yet EXP- 
hard (provably harder than polynomial time) to find approximately optimal policies (in 
time measured in the size of the input) if the input is represented succinctly (see Section 5). 

Note that there are two interpretations to finding an approximation: finding a policy 
with value close to that of the optimal policy, or simply calculating a value that is close to 
the optimal value. If we can do the former and can evaluate policies, then we can certainly 
do the latter. Therefore, we sometimes show that the latter cannot be done, or cannot be 
done in time polynomial in the size of the input (unless something unlikely is true). 

The complexity class PSPACE consists of those languages recognizable by a Turing 
machine that uses only p{n) memory for some polynomial p, where n is the size of the 
input. Because each time step uses at most one unit of memory, P C PSPACE, though 
we do not know whether that is a proper inclusion or an equality. Because, given a limit 
on the amount of memory used, there are only exponentially many configurations of that 
memory possible with a fixed finite alphabet, PSPACE C EXP. It is not known whether 
this is a proper inclusion or an equality either, although it is known that P / EXP. Thus, 
a PSPACE- hardness result says that the problem is apparently not tractable, but an EXP- 
hardness result says that the problem is certainly not tractable. 

Researchers also consider problems that are P-complete (under logspace or other highly 
restricted reductions). For example, the policy existence problem for infinite-horizon MDPs 
is P-complete (Papadimitriou & Tsitsiklis, 1987). This is useful information, because it is 
generally thought that P-complete problems are not susceptible to significant speed-up via 
parallelization. (For a more thorough discussion of P-completeness, see Greenlaw, Hoover, 
& Ruzzo, 1995.) 

We also know that NP C PSPACE, so P = PSPACE would imply P = NP. Thus, 
any argument or belief that P ^ NP implies that P 7^ PSPACE. (For elaborations of this 
complexity theory primer, see any complexity theory text, such as Papadimitriou, 1994.) 

In this paper, we show that there is a necessary trade-off between running time guar- 
antees and performance guarantees for any general POMDP approximation algorithm — 
unless P = NP or P = PSPACE. (Table 1 gives an overview of our results.) Note that 
(assuming P / NP or P 7^ PSPACE) this tells us that there is no algorithm that runs in 
time polynomial in the size of the representation of the POMDP that finds a policy that 
is close to optimal for every instance. It does not say that fast algorithms will produce 
far-from-optimal values for all POMDPs; there are many instances where the algorithms 
already in use or being developed will be both fast and close. We simply can't guarantee 
that the algorithms will always find a close-to-optimal policy quickly. 
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(Madani et al., 1999) 
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time-dependent - 
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A;- additive app. 


P-hard 


stationary 2TBN 


n 


A;- additive app. 


EXP-hard 



Table 1: Hardness for partially and fully observable MDPs 



1.1 Heuristics and Approximations 

The state of the art with respect to POMDP policy-finding algorithms is that there are 
three types of algorithms in use or under investigation: exact algorithms, approximations, 
and heuristics. Exact algorithms attempt to find exact solutions. In the finite-horizon 
cases, they run in worst-case time at least exponential in the size of the POMDP and 
the horizon (assuming a straightforward representation of the POMDP). In the infinite 
horizon, they do not necessarily halt, but can be stopped when the policy is within e of 
optimal (a checkable condition). Approximation algorithms construct approximations to 
what the exact algorithms find. (Examples of this include grid-based methods, Hauskrecht, 
1997; Lovejoy, 1991; White, 1991.) Heuristics come in two flavors: those that construct or 
find actual policies that can be evaluated, and those that specify a means of choosing an 
action (for example, "most likely state"), which do not yield policies that can be evaluated 
using the standard, linear algebra-based methods. 

The best current exact algorithm is incremental pruning (IP) with point-based improve- 
ment (Zhang, Lee, Sz Zhang, 1999). Liftman's analysis of the witness algorithm (Liftman, 
Dean, & Kaelbling, 1995; Cassandra, Kaelbling, & Liftman, 1995) still applies: This algo- 
rithm requires exponential time in the worst case. The underlying theory of these algorithms 
(Witness, IP, etc.) for inflnite-horizon cases depends on Bellman's and Sondik's work on 
value iteration for MDPs and POMDPs (Bellman, 1957; Sondik, 1971; Smallwood & Sondik, 
1973). 

The best known family of approximation algorithms is known as grid methods. The 
basic idea is to use a finite grid of points in the belief space (the space of all probability 
distributions over the states of the POMDP — this is the underlying space for the algo- 
rithms mentioned above) to define a policy. Once the grid points are chosen, all of these 
algorithms use value iteration on the points to obtain a policy for those belief states, then 
interpolate to the whole belief space. The difference in the algorithms lies in the choice of 
grid points. (An excellent survey appears in Hauskrecht, 1997.) These algorithms are called 
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approximation algorithms because they approximate the process of value iteration, which 
the exact algorithms algorithms carry out exactly. 

Heuristics that do not yield easily evaluated policies are surveyed in (Cassandra, 1998). 
These are often very easy to implement, and include techniques such as "most likely state" 
(choosing a state with the highest probability from the belief state, and acting as if the 
system were fully observable), and minimum entropy (choosing the action that gives the 
most information about the current state). Others depend on "voting," where several 
heuristics or options are combined. 

There are heuristics based on finite histories or other uses of finite amounts of memory 
within the algorithm (Sondik, 1971; Platzman, 1977; Hansen, 1998a, 1998b; Lusena, Li, Sit- 
tinger, Wells, & Goldsmith, 1999: Meuleau, Kim, Kaelbling, & Cassandra, 1999: Meuleau, 
Peshkin, Kim, & Kaelbling, 1999; Peshkin, Meuleau, & Kaelbling, 1999; Hansen &; Feng, 
2000; Kim, Dean, & Meuleau, 2000). None of these comes with proofs of closeness, except 
for some of Hansen's work. For the rest, the trade-off has been made between fast searching 
through policy space and guarantees. 

1.2 Structure of This Paper 

In Section 2, we give formal definitions of MDPs and POMDPs and policies; two-phase tem- 
poral Bayes nets (2TBNs) are defined in Section 5. In Section 3, we define e-approximations 
and additive approximations, and show a relationship between the two types of approxima- 
bility for MDPs and POMDPs. 

We separate the complexity results for finite-horizon policy approximation from those 
for infinite-horizon policies. Section 4 contains nonapproximability results for finite-horizon 
POMDP policies; Section 6 contains nonapproximability for infinite-horizon POMDP poli- 
cies. Although it is relatively easy to find optimal MDP policies, we consider approximating 
MDP policies in Section 5, since the succinctly represented case, at least, is provably hard 
to approximate. 

Some of the more technical proofs are included in appendices in order to make the body 
of the paper more readable. However, some proofs from other papers are sketched in the 
body of the paper in order to motivate both the results and the proofs newly presented 
here. 

2. Definitions 

Note that MDPs are in fact special cases of POMDPs. The complexity of finding and 
approximating optimal policies depends on the observability of the system, so our results 
are segregated by observability. However, one set of definitions suffices. 

2.1 Partially Observable Markov Decision Processes 

A partially observable Markov decision process (POMDP) describes a controlled stochastic 
system by its states and the consequences of actions on the system. It is denoted as a tuple 
M = (5, So, A, O, t, o, r), where 

• S, A and O are finite sets of states, actions and observations; 
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• So G (S is the initial state; 

• i : (S X ^ X (S — >^ [0, 1] is the state transition function, where t{s, a, s') is the probabihty 
that state s' is reached from state s on action a (for every s E S and a E A: either 
Tis'est{s, a, s') = 1, if action a can be apphed on state s, or Tis'est{s, a, s') = 0); 

• o : S ^ O is the observation function, where o{s) is the observation made in state 

• r:5x^^Qis the reward function, where r{s,a) is the reward gained by taking 
action a in state s. 

If states and observations are identical, i.e. O = S and o is the identity function (or 
a bijection), then the MDP is called fully observable. Another special case is unobservable 
MDPs , where the set of observations contains only one element, i.e. in every state the same 
observation is made, and therefore the observation function is constant. 

Normally, MDPs are represented by 5 x 5 tables, one for each action. However, we will 
also discuss more succinct representations: in particular, two-phase temporal Bayes nets 
(2TBNS). These will be defined in Section 5. 

2.2 Policies and Performances 

A policy describes how to act depending on observations. We distinguish three types of 
policies. 

• A stationary policy -Kg (for M) is a function -Kg : O ^ A, mapping each observation 
to an action. 

• A time- dependent policy nt is a function tt^ : O x N ^ ^, mapping each pair 
(observation, time) to an action. 

• A history- dependent policy tt/j is a function nh ■ O* ^ A, mapping each finite sequence 
of observations to an action. 

Notice that, for an unobservable MDP, a history-dependent policy is equivalent to a 
time-dependent one. 

Recent algorithmic development has included consideration of finite memory policies 
as well (Hansen, 1998b, 1998a; Lusena, Li, Sittinger, Wells, Sz Goldsmith, 1999; Meuleau, 
Kim, Kaelbling, Sz Cassandra, 1999; Meuleau, Peshkin, Kim, Sz Kaelbling, 1999; Peshkin, 
Meuleau, & Kaelbling, 1999: Hansen & Feng, 2000: Kim, Dean, & Meuleau, 2000). These 
are policies that are allowed some finite amount of memory; sufficient allowances would 
enable such a policy to simulate a full history-dependent policy over a finite horizon, or 
perhaps a time-dependent policy, or to use less memory more judiciously. One variant 
of finite memory policies, which we call free finite memory policies, fixes the amount of 
memory a priori. 

More formally, a free finite memory policy with the finite set M of memory states 
for POMDP M = {S,A,0,t,o,r) is a function nf : O x M ^ A x M, mapping each 

1. Note that making observations probabilistically does not add any power to MDPs. Any probabilistically 
observable MDP can be turned into one with deterministic observations with only a polynomial increase 
in its size. 
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(observation, memory state) pair to a pair (action, memory state). The set of memory 
states M can be seen as a finite "scratch" memory. 

Free finite memory pohcies can also simulate stationary policies; all hardness results for 
stationary policies apply to free finite memory policies as well. Because one can consider 
a free finite memory policy to be a stationary policy over the state space S x Ai, all 
upper bounds (complexity class membership results) for stationary policies hold for free 
finite memory policies as well. The advantages of free finite memory policies appear in 
the constants of the algorithms, and in special, probably large, subclasses of POMDPs, 
where a finite amount of memory suffices for an optimal policy. The maze instances such as 
McCallum's maze (McCallum, 1993; Littman, 1994) are such examples: McCallum's maze 
requires only 1 bit of memory to find an optimal policy. 

Let M = (5, So, A, O, t, o, r) be a POMDP. 

A trajectory of length m for M is a sequence of states = ctq, cri, cr2, . . . , cr^ (m > 0, 
(Tj G S) which starts with the initial state of M, i.e. = sq. We use Tk{s) to denote the 
set of length-A; trajectories which end in state s. 

The expected reward obtained in state s after exactly k steps under policy tt is the 
reward obtained in s by taking the action specified by tt, weighted by the probability that 
s is actually reached after k steps, 

• r{s,k,Tr) = r{s,Tr{o{s))) ■ 0^=1 *(crj-i, 7!"(o(crj_i)), crj), if tt is a stationary 

(o-o,...,o-i,)eri.(s) 

policy, 

• r{s,k,TT) = r{s,n{o{s),k)) ■ Y. Ui^it{(^i-i,'^{o{c^i-i),i - l),cri), if tt is a 

(o-o,...,o-j.)erfe(s) 

time-dependent policy, and 

• r(s,A;,7r)= Y 7i"(o(cro) ■ ■ ■ o(crfe))) ■ HLi 7r(o(cro) ■ ■ ■ o(crj_i)), crj), 

if TT is a history-dependent policy. 

A POMDP may behave differently under optimal policies for each type of policy. The 
quality of a policy is determined by its performance, i.e. by the expected rewards accrued 
by it. We distinguish between different performance metrics for POMDPs that run for a 
finite number of steps and those that run indefinitely. 

• The finite-horizon performance of a policy tt for POMDP M is the expected sum of 
rewards received during the first \M\ steps by following the policy tt, i.e., perff{M, tt) = 

Sl^d ^ Ssg5 '"(■®' (Other work assumes that the horizon is poly{\M\), instead of 
\M\. This does not change the complexity of any of our problems.) 

• The infinite-horizon total discounted performance gives rewards obtained earlier in 
the process a higher weight than those obtained later. For < /? < 1, the total 
/3-discounted reward is defined as perf^^{M, tt) = YiZo J2sgs ' ^(•'*' h ^r)- 

• The infinite-horizon average performance is the limit of all rewards obtained within n 
steps divided by n, for n going to infinity:^ perf g^^{M,'K) = lim„_^oo ^pe*"//!-^) 

2. If this limit is not defined, the performance is defined as a liminf. 
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Let perf be any of these performance metrics, and let a be any policy type, either 
stationary, time-dependent, or history-dependent. The a-value vala(M) of M (under the 
metric chosen) is the maximal performance of any policy tt of type a for M, i.e. valQ,(M) = 
max^gn^ perf {M,n), where Eq, is the set of all a policies. 

For simplicity, we assume that the size |M| of a POMDP M is determined by the size 
n of its state space. We assume that there are no more actions than states, and that each 
state transition probability is given as a binary fraction with n bits and each reward is an 
integer of at most n bits. This is no real restriction, since adding unreachable "dummy" 
states allows one to use more bits for transition probabilities and rewards. Also, it is 
straightforward to transform a POMDP M with non-integer rewards to M' with integer 
rewards such that vala(M, k) = c-va[a{M' , k) for some constant c depending only on (M, k) 
and not on valQ,(M, k). 

We consider problem instances that are represented in a straightforward way. A PO- 
MDP with n states is represented by a set of n x n tables for the transition function (one 
table for each action) and a similar table for the reward function and for the observation 
function. We assume that the number of actions and the number of bits needed to store each 
transition probability or reward does not exceed n, so such a representation requires O(n^) 
bits. (This can be modified to allow bits without changing the complexity results.) In 
the same way, stationary policies can be encoded as lists with n entries, and time-dependent 
policies for horizon n as n x n tables. 

For each type of POMDP, each type of policy, and each type of performance metric the 
value problem is, 

given a POMDP, a performance metric (finite- horizon, total discounted, or average per- 
formance), and a policy type (stationary, time-dependent, or history-dependent), 

calculate the value of the best policy of the specified type under the given performance 
metric. 

The policy existence problem is, 

given a POMDP, a performance metric, and a policy type, 

decide whether the value of the best policy of the specified type under the given perfor- 
mance metric is greater 0. 

3. Approximability 

In previous work (Papadimitriou & Tsitsiklis, 1986, 1987; Mundhenk, Goldsmith, & Allen- 
der, 1997; Mundhenk, Goldsmith, Lusena, & AUender, 2000; Madani et al., 1999), it was 
shown that the policy existence problem is computationally intractable for most variations 
of POMDPs, or even undecidable for some infinite-horizon cases. For example, we showed 
that the stationary policy existence problems for POMDPs with or without negative re- 
wards are NP-complete. Computing an optimal policy is at least as hard as deciding the 
existence problem. Instead of asking for an optimal policy, we might wish to compute a 
policy that is guaranteed to have a value that is at least a large fraction of the optimal 
value. 
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A polynomial-time algorithm computing such a nearly optimal policy is called an e- 
approximation (for < e < 1), where e indicates the quality of the approximation in the 
following way. Let j4 be a polynomial-time algorithm which for every POMDP M computes 
an a-policy A{M). Notice that perf{M,A{M)) < vala(M) for every M. The algorithm A 
is called an e- approximation if for every POMDP M, 

Ycila{M)-perf{M,A{M)) 
valc,(M) 

(See, e.g., Papadimitriou, 1994 for more detailed definitions.) Approximability distin- 
guishes NP-complete problems: There are problems which are e-approximable for all e, for 
certain e, or for no e (unless P = NP). Note that this definition of e-approximation requires 
that vala(M) > 0. If a policy with positive performance exists, than every approximation 
algorithm yields such a policy, because a policy with performance or smaller cannot ap- 
proximate a policy with positive performance. Hence, any approximation straightforwardly 
solves the decision problem. 

An approximation scheme yields an e-approximation for arbitrary e > 0. If there is a 
polynomial-time algorithm that on input POMDP M and e outputs an e-approximation of 
the value, in time polynomial in the size of M then we say the problem has a Polynomial- 
Time Approximation Scheme (PTAS). If the algorithm runs in time polynomial in the size 
of M and ^, the scheme is a Fully Polynomial- Time Approximation Scheme (FPTAS). All 
of the PTASs constructed here are FPTASs; we state the theorems in terms of PTASs 
because that gives stronger results in some cases, and because we do not explicitly analyze 
the complexity in terms of ^. 

If there is a polynomial-time algorithm that outputs an approximation, to the value 
of M (ju = vala(M)) with fJ, > v > fj, — k, then we say that the problem has a k-additive 
approximation algorithm. 

In the context of POMDPs, existence of a A;-additive approximation algorithm and a 
PTAS are often equivalent. This might seem surprising to readers who are more familiar 
with reward criteria that have fixed upper and lower bounds on the performance of a 
solution, for example, the probability of reaching a goal state. In these cases, the fixed 
bounds on performance will give different results. However, we are addressing the case 
where there is no a priori upper bound on the performance of policies, even though there 
are computable upper bounds on the performance of a policy for each instance. 

Theorem 3.1 For POMDPs with non-negative rewards and flat representations under 
finite-horizon total or total discounted, and infinite-horizon total discounted reward met- 
rics, if there exists a k-additive approximation, then we can determine in polynomial time 
whether there is a policy with performance greater than 0. 

Proof The theorem follows from two facts: (1) given a POMDP M with value /j,, we can 
construct another POMDP 6M with value 6 ■ ji just by multiplying all rewards in the former 
POMDP by 0\ (2) under these reward metrics we can find a lower bound on |U if it is not 0. 

The computation of the lower bound, 5, on the value of fj, depends on the reward metric. 
Because there are no negative rewards, in order for the expected reward to be positive in the 
finite-horizon case, an action with positive reward must be taken with nonzero probability 
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by the last step. Consider only reachable states of the POMDP, and let u be the lowest 
nonzero transition probability to one of these states, h the horizon, and ( the smallest 
nonzero reward, and set 8 = v^Q. Then is a lower bound on the probability of actually 
reaching any particular state after h steps (if this probability is nonzero), in particular a 
state with reward Q. If the reward metric is discounted, then let 5 = (7^')'*C: where 7 G (0, 1] 
is the discount factor. 

Now consider the infinite-horizon under a stationary policy. This induces a Markov 
process, and the policy has nonzero reward if there is a nonzero probability path to a 
reward node, i.e., a state from which there is a positive-reward action possible. This is true 
if and only if there is a nonzero-probability simple path (visiting each node at most once) 
to a reward node. Such a path accrues reward at least 5 = (7/^)''^'C for stationary policies. 

Since stationary policies have values bounded by the time-dependent and history-depend- 
ent values for infinite-horizon POMDPs, this lower bound for the stationary value of the 
POMDP is also a lower bound for other policies. 

Finally, note that if the value of a POMDP with non-negative rewards is 0, then a 
A;-additive approximation cannot return a positive value. To determine whether there is 
a policy with reward greater than for a given POMDP, compute 5 and then set such 
that 05 — k > i.e., ^ > f, and run the A;-additive approximation algorithm on 6M. The 
POMDP has positive value if and only if the approximation returns a positive value. □ 

Note that this does not contradict the undecidability result of Madani et al. (1999). The 
problem that they proved undecidable is whether a POMDP with nonpositive rewards has 
a history-dependent or time-dependent value of 0. We're asking whether it has value > 
in the non-negative reward case; answering this question (even if we multiply the rewards 
by —1) does not answer their question. 

Corollary 3.2 For POMDPs with flat representations and non-negative rewards, the PO- 
MDP value problem under finite-horizon or infinite-horizon total discounted reward is k- 
additive approximable if and only if there exists a PTAS for that POMDP value problem. 

Note that the corollary depends only on Facts (1) and (2) from the proof of Theorem 3.1. 
Thus, any optimization problem with those properties will have a A;-additive approximation 
if and only if it has a PTAS. 

Proof Let /j, = valQ,(M), and let A he a polynomial-time A;-additive approximation algo- 
rithm. First, the PTAS computes S as in Theorem 3.1 and checks whether /i = 0. If so, it 
outputs 0. Otherwise, given e, it chooses 6 such that > > and thus Ofi—k > {l—s)0iJ, 
holds. Let v = A{dM). (Note that A{M) is the approximation to the value of M found by 
running algorithm ^4.) Then v is an e-approximation to 6fj,, so | is an e-approximation to 
H. 

Suppose, instead, that we have a PTAS for optimal policies for this problem. Let 
A{M,e) be an algorithm that demonstrates this. Let /j, = valQ,(M), and ^(M, 0.5) = v. 
Thus fJ- > V > ^. If /i = then v = and we can stop. Else we choose an e such that 
(1 — e)fi > fJ. — k, giving e < ^- Since ^ < ^, and ^ is polynomial size and is polynomial- 
time computable in \M\ (since v is the output of j4(M, 0.5)), we can choose e < and 
run A{M,s). This gives a A;-additive approximation. □ 
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A problem that is not e-approximable for some e cannot have a PTAS. Therefore, 
any multiplicative nonapproximability result yields an additive nonapproximability result. 
However, an additive nonapproximability result only shows that there is no PTAS, although 
there might be an e-approximation for some fixed e. 

4. Non-Approximability for Finite-Horizon POMDPs 

This section focuses on finite-horizon policies. Because that is consistent throughout the 
section, we do not explicitly mention it in each theorem. However, as Section 6 shows, there 
are significant computational differences between finite- and infinite-horizon calculations. 

The policy existence problem for POMDPs with negative and non-negative rewards 
is not suited for e-approximation. If a policy with positive performance exists, then every 
approximation algorithm yields such a policy, because a policy with performance or smaller 
cannot approximate a policy with positive performance. Hence, the decision problem is 
straightforwardly solved by any e-approximation. Therefore, we concentrate on POMDPs 
with non-negative rewards. Results for POMDPs with unrestricted rewards are stated as 
corollaries. Consider an e-approximation algorithm A that, on input a POMDP M with 
non-negative rewards, outputs a policy tt^ of type a. Then it holds that 

perf{M,TT^) > (l-e)-val„(M). 

We first consider the question of whether an optimal stationary policy can be e-approx- 
imated for POMDPs with non-negative rewards. It is known (Liftman, 1994; Mundhenk 
et al., 2000) that the related decision problem is NP-complete. We include a sketch of that 
proof here, since later proofs build on it. The formal details can be found in Appendix A. 

Theorem 4.1 (Littman, 1994; Mundhenk et al., 2000) The stationary policy existence 
problem for POMDPs with non-negative rewards is NP-complete. 

Proof Membership in NP is straightforward, because a policy can be guessed and evaluated 
in polynomial time. To show NP-hardness, we reduce the NP-complete satisfiability problem 
3Sat to it. Let (b( xi,...,Xn) be such a formula with variables xi,...,Xn and clauses 
Ci, . . . ,Cm, where clause Cj = {lv{i,j) V /^(2,j) lv{3,j)) foi' G {xi,^Xi}. We say that 
variable Xi appears in Cj with signum (resp. \) ii^Xi (resp. Xi) is a literal in Cj. Without 
loss of generality, we assume that every variable appears at most once in each clause. The 
idea is to construct a POMDP M{cj)) having one state for each appearance of a variable 
in a clause. The set of observations is the set of variables. Each action corresponds to an 
assignment of a value to a variable. The transition function is deterministic. The process 
starts with the first variable in the first clause. If the action chosen in a certain state satisfies 
the corresponding literal, the process proceeds to the first variable of the next clause, or 
with reward 1 to a final sink state T if all clauses were considered. If the action does not 
satisfy the literal, the process proceeds to the next variable of the clause, or with reward 
to a sink state F. A sink state will never be left. The partition of the state space into 
observation classes guarantees that the same assignment is made for every appearance of 
the same variable. Therefore, the value of M{4>) equals 1 iff is satisfiable. The formal 
reduction is in Appendix A. □ 
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Note that all policies have expected reward of either 1 or 0. Immediately we get the 
nonapproximability result for POMDPs, even if all trajectories have non-negative perfor- 
mance. 

Theorem 4.2 Let < e < 1. An optimal stationary policy for POMDPs with non-negative 
rewards is e-approximable if and only ifP = NP. 

Proof The stationary value of a POMDP can be calculated in polynomial time by a 
binary search using an oracle for the stationary policy existence problem for POMDPs. 
The number of bits to be calculated is polynomial in the size of M. Knowing the value, we 
can try to fix an action for an observation. If the modified POMDP still achieves the value 
calculated before, we can continue with the next observation, until a stationary policy is 
found which has the optimal performance. This algorithm runs in polynomial time with an 
oracle solving the stationary policy existence problem for POMDPs. Since the oracle is in 
NP, by Theorem 4.1, the algorithm runs in polynomial time if P = NP. 

Now, assume that ^ is a polynomial-time algorithm that e-approximates the optimal 
stationary policy for some e with < e < 1. We show that this implies that P = NP by 
showing how to solve the NP-complete problem 3Sat. As in the proof of Theorem 4.1, 
given an instance of 3Sat, we construct a POMDP M((/)). The only change to the reward 
function of the POMDP constructed in the proof of Theorem 4.1 is to make it a POMDP 
with positive performances. Now reward 1 is obtained if state F is reached, and reward 
[^i-] is obtained if state T is reached. Hence 4> is satisfiable if and only if M{4>) has value 

Assume that policy tt is the output of the e-approximation algorithm A. If is satisfiable, 

then perf {M {(p) , tt) > (1 — e) • = 2 > 1. Because the performance of every policy for 
M{(p) is either 1 if (;6 is not satisfiable, or [j^] if (p is satisfiable, it follows that tt has 
performance > 1 if and only if is satisfiable. So, in order to decide cp G 3 Sat, we can 
construct M(0), run the approximation algorithm A on it, take its output tt and calculate 
perf {M {({)), n) . That output shows whether (p is in 3Sat. All these steps are polynomial- 
time bounded computations. It follows that 3Sat is in P, and hence P = NP. □ 

Of course, the same nonapproximability result holds for POMDPs with positive and 
negative rewards. 

Corollary 4.3 Let < e < 1. Any optimal stationary policy for POMDPs is e-approxi- 
mable if and only ifV = NP. 

Using the same proof technique as above, we can show that the value is nonapproxi- 
mable, too. 

Corollary 4.4 Let < £ < 1. The stationary value for POMDPs is e- approximate if and 
only i/P = NP. 

A similar argument can be used to show that a policy with performance at least the 
average of all performances for a POMDP cannot be computed in polynomial time, unless 
P = NP. Note that in the proof of Theorem 4.1, the only performance greater than or equal 
to the average of all performances is that of an optimal policy. 
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Corollary 4.5 The following are equivalent. 

1. There exists a polynomial-time algorithm that for a given POMDP M computes a 
stationary policy under which M has performance greater than or equal to the average 
stationary performance of M. 

2. P = NP. 

Thus, even calculating a policy whose performance is above average is likely to be infeasible. 

We now turn to time-dependent policies. The time-dependent policy existence problem 
for POMDPs is known to be NP-complete, as is the stationary one. 

Theorem 4.6 (Mundhenk et al., 2000) The time- dependent policy existence problem for 
unobservable MDPs is NP-complete. 

Papadimitriou and Tsitsiklis (1987) proved a theorem similar to Theorem 4.6. Their 
MDPs had only non-positive rewards, and their formulation of the decision problem was 

whether there is a policy with performance 0. The proof by Mundhenk et al., 2000, like 
theirs, uses a reduction from 3Sat. We modify this reduction to show that an optimal 
time-dependent policy is hard to approximate even for unobservable MDPs. 

Theorem 4.7 Let < e < 1. Any optimal time- dependent policy for unobservable MDPs 
with non-negative rewards is e-approximable if and only ifP = NP. 

Proof We give a reduction from 3 Sat with the following properties. For a formula (f) 
with m clauses we show how to construct an unobservable MDP M^{(p) with value 1 if </) is 
satisfiable, and with value < {1 — e) ii (p is not satisfiable. Therefore, an £-approximation 
could be used to distinguish between satisfiable and unsatisfiable formulas in polynomial 
time. 

For formula (f>, we first show how to construct an unobservable M{(f>) from which Mglcp) 
will be constructed. (The formal presentation appears in Appendix B.) M{4i) simulates the 
following strategy. At the first step, one of the m clauses is chosen uniformly at random 
with probability ^. At step « + 1, the assignment of variable i is determined. Because the 
process is unobservable, it is guaranteed that each variable gets the same assignment in all 
clauses, because its value is determined in the same step. If a clause is satisfied by this 
assignment, a final state will be reached. If not, an error state will be reached. 

Now, construct M^icp) from copies Mi, . . . , M^2 of M^, such that the initial state 
of Mp{cj)) is the initial state of Mi, the initial state of Mj+i is the final state T of Mj, and 
reward 1 is gained if the final state of M^2 is reached. The error states of all the MjS are 
identified as a unique sink state F. 

To illustrate the construction, in Figure 1 we give an example POMDP consisting of a 
chain of 4 copies of M((/)) obtained for the formula (f) = {-^xi V 2:3 V 2:4) A {xi V -1X2 V xa). 
The dashed arrows indicate a transition with probability ^. The dotted (resp. solid) arrows 
are probability 1 transitions on action (resp. 1). The actions correspond to assignments 
to the variables. 

If (f) is satisfiable, then a time-dependent policy simulating m? repetitions of any satis- 
fying assignment has performance 1. li (j) \s not satisfiable, then under any assignment at 
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Figure 1: An example unobservable MDP for = {-'Xi V 2:3 V X4) A {xi V -1X2 V X4) 

least one of the m clauses of cj) is not satisfied. Hence, the probability that under any time- 
dependent policy the final state T of M(0) is reached is at most 1 — Consequently, the 
probability that the final state of Me(0) is reached is at most (1 — ;^)"*^ < e~™. This prob- 
ability equals the expected reward. Since for large enough m it holds that e^™ < (1 — e), 
the theorem follows. □ 

Note that the time-dependent policy existence problem for POMDPs with non-negative 
rewards is NL-complete (Mundhenk et al., 2000). The class NL consists of those languages 
recognizable by nondeterministic Turing machines that use a read-only input tape and 
additional read-write tapes with O(logn) tape cells. It is known that NL C P and that NL 
is properly contained in PSPACE. Unlike the case of stationary policies, approximability 
of time-dependent policies is harder than the policy existence problem (unless NL = NP). 

Unobservability is a special case of partial observability. Hence, we get the same non- 
approximability result for POMDPs, even for unrestricted rewards. 

Corollary 4.8 Let < e < 1. Any optimal time- dependent policy for POMDPs is e-appro- 
ximable if and only ifP = NP. 

Corollary 4.9 Let < e < 1. The time- dependent value of POMDPs is s-approximable if 
and only if P = NP. 

Note that the proof of Theorem 4.7 assumed a total expected reward criterion. The 
discounted reward criterion is also useful in the finite horizon. To show the result for a 
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discounted reward criterion, we only need to change the reward in the proof of Theorem 4.7 
as follows: Multiply the final reward by /3~"^ (n+i)^ where f3 is the discount factor, m the 
number of clauses, and n the number of variables of the formula (f). 

Papadimitriou and Tsitsiklis (1987) proved that a problem very similar to history- 
dependent policy existence is PSPACE-complete. 

Theorem 4.10 (Papadimitriou & Tsitsiklis, 1987; Mundhenk et al, 2000) The history- 
dependent policy existence problem for POMDPs is PSPACE-comp/eie. 

To describe a horizon n history-dependent policy for a POMDP with c observations 
explicitly takes space c*. (We do not address the case of succinctly represented policies 
for POMDPs here. For an analysis of their complexity, see Mundhenk, 2000a.) If c > 1, this 
is exponential space. Therefore, we cannot expect that a polynomial-time algorithm outputs 
a history-dependent policy, and we restrict consideration to polynomial-time algorithms that 
approximate the history-dependent value — the optimal performance under any history- 
dependent policy — of a POMDP. Burago, de Rougemont, and Slissenko (1996) considered 
the class of POMDPs with a bound of q on the number of states corresponding to an 
observation, where the rewards corresponded to the probability of reaching a fixed set of goal 
states (and thus were bounded by 1). They showed that for any fixed q, the optimal history- 
dependent policies for POMDPs in this class can be approximated to within an additive 
constant k. We showed in Proposition 3.2 that POMDP history-dependent discounted or 
total-reward value problems that can be approximated to within an additive constant k 
have polynomial-time approximation schemes (Proposition 3.2), as long as there are no a 
priori hounds on either the number of states per observation or the rewards. 

Notice, however, that Theorem 4.11 does not give us information about the classes of 
POMDPs that Burago et al. (1996) considered: Because of the restrictions associated with 
the parameter g, our hardness results do not contradict their result. 

Finally, we show that the history-dependent value of POMDPs with non-negative re- 
wards is not e-approximable under total expected or discounted rewards, unless P = 
PSPACE. Consequently, the value has no PTAS or A;-additive approximation under the 
same assumption. 

The history-dependent policy existence problem for POMDPs with non-negative re- 
wards is NL-complete (Mundhenk et al., 2000). Hence, because NL is a proper subclass of 
PSPACE, approximability of the history-dependent value is strictly harder than the policy 
existence problem. 

Theorem 4.11 Let < e < 1. The history-dependent value of POMDPs with non-negative 
rewards is e-approximable if and only ifP = PSPACE. 

Proof The history-dependent value of a POMDP M can be calculated using binary search 
over the history-dependent policy existence problem. The number of bits to be calculated is 
polynomial in the size of M. Therefore, by Theorem 4.10, this calculation can be performed 
in polynomial time using a PSPACE oracle. If P = PSPACE, it follows that the history- 
dependent value of a POMDP M can be exactly calculated in polynomial time. 

The set QSAT of true quantified Boolean formulae is one of the standard PSPACE 
complete sets. To conclude P = PSPACE from an e-approximation of the history-dependent 
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value problem, we use a transformation of instances of QsAT to POMDPs similar to the 
proof of Theorem 4.10 in (Mundhenk, 2000b). 

The set QsAT can be interpreted as a two-player game: Player 1 sets the existentially 
quantified variables, and player 2 sets the universally quantified variables. Player 1 wins if 
the alternating choices determine a satisfying assignment to the formula, and player 2 wins 
if the determined assignment is not satisfying. A formula is in QSAT if and only if player 1 
has a winning strategy. This means player 1 has a response to every choice of player 2, so 
that in the end the formula will be satisfied. 

The version where player 2 makes random choices and player I's goal is to win with 
probability > ^ corresponds to SSAT {stochastic satisfiability), which is also PSPACE com- 
plete. The instances of SsAT are formulas which are quantified alternatingly with existential 
quantifiers 3 and random quantifiers R. The meaning of the random quantifier R is that 
an assignment to the respective variable is chosen uniformly at random from {0, 1}. A 
stochastic Boolean formula 

$ = BxiRx2^X3Rx4 ... 

is in SsAT if and only if 
there exists bi for random 62 exists 63 for random 64 . . . Prob[(p{bi, . . . , 5,,,) is true] > ^. 

If $ has r random quantifiers, then the strategy of player 1 determines a set of 2"^ 
assignments to (^. The term "Pro6[0(6i, . . . , 6„) is true] > ^" means that more than 2*""^ 
of these 2*" assignments satisfy 0. 

Prom the proof of IP = PSPACE by Shamir (1992) it follows that for every PSPACE 
set A and constant c > 1 there is a polynomial-time reduction / from A to SSAT such that 
for every instance x and formula f(x) = 3xiRx2 ■ ■ ■ 4>x the following holds. 

• li X & A, then Bbi for random 62 ■ ■ ■ Probl^xibi, ■ ■ ■ ,bn) is true] > (1 — 2^"^), and 

• a X ^ A, then V61 for random 62 ■ ■ ■ Probl^xibi, ■ ■ ■ , bn) is true] < 2^*^. 

This means that player 1 either has a strategy under which she wins with very high prob- 
ability, or the probability of winning (under any strategy) is very small. We show how to 
transform a stochastic Boolean formula ^ into a POMDP with a large history-dependent 
value if player 1 has a winning strategy, and a much smaller value if player 2 wins. 

For an instance $ = BxiRx2 ■ ■ ■ (1) of SSAT, where $ is a formula with n variables 
xi,...,2;„, we construct a POMDP M($) as follows. The role of player 1 is taken by 
the controller of the process. A strategy of player 1 determines a policy of the controller, 
and vice versa. Player 2 appears as probabilistic transitions in the process. The process 
M($) has three stages. The first stage consists of one step. The process chooses uniformly 
at random one of the variables and an assignment to it, and stores the variable and the 
assignment. More formally, from the initial state sq? one of the states = 6" (1 < « < n, 
b G {0,1}) is reached, each with probability l/(2n). It is not observable which variable 
assignment was stored by the process. However, whenever that variable appears later, the 
process checks that the initially fixed assignment is chosen again. If the policy gives a 
different assignment during the second stage, the process halts with reward 0. (There is 
a deterministic transition to a final state which we refer to as Se„d, or less formally, the 
dead end state.) If such an inconsistency occurs during the third stage, the process halts 
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Figure 2: The first stage of M($). 

with reward and notices that the pohcy cheats. (There is a deterministic transition to a 
sink state which we refer to as Scheau or less formally as the penalty box because the player 
sent there cannot re-enter the game later.) If eventually the whole formula is passed, either 
reward 2 or reward is obtained dependent on whether the formula was satisfied or not. 
The first stage is sketched in Figure 2. In this and the following figures, dashed arrows 
represent random transitions (all of equal probability, irregardless of the action chosen), 
solid arrows represent deterministic transitions corresponding to the action 1 (True), and 
dotted arrows represent deterministic transitions corresponding to the action (False). 

The second stage starts in each of the states "a;^ = 6" and has n steps, during which an 
assignment to the variables xi,X2, ■ ■ • is fixed. Let A^^b denote the part of the process' 
second stage during which it is assumed that value h is assigned to variable Xc- If a variable 
Xi is existentially quantified, then the assignment is the action in {0, 1} chosen by the policy. 
If a variable x-i is randomly quantified, then the assignment is chosen uniformly at random 
by the process, independent of the action of the policy. In the second stage, it is observable 
which assignment was made to every variable. If the variable assignment from the first 
stage does not coincide with the assignment made to that variable during the second stage, 
the trajectory on which that happens ends in the dead end state that yields reward 0. Let 
r be the number of random quantifiers of <&. Every strategy of player 1 determines 2^ 
assignments. Every assignment [xi = bi, . . . ,Xn = bn) induces 2n trajectories: n have the 
form 

So, Xi=bi, [Xi.bi],..., [Xi, bi], . . . , [Xn, bn] 

(for i = 1, 2, . . . , n) that pass stage 2 without reaching the dead end state and continue with 
the first state of the third stage, and n that dead-end in stage 2. The latter n trajectories 
that do not reach stage 3 are of the form 

So,Xi=bi, [xi,bi], [xi, 1 - bi],Send 

(for i = 1, 2, . . . , n). Accordingly, M($) has n ■ 2^ trajectories that reach the third stage. 
The structure of stage 2 is sketched for xs = in Figure 3. 

The third stage checks whether is satisfied by that trajectory's assignment. The 
process passes sequentially through the whole formula checking each literal in each clause for 
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Figure 3: The second stage of M($): ^3^0 for the quantifier prefix Rxi3x2Rxs3x4. 

an assignment to the respective variable.^ The case of a cheating policy, i.e., one that answers 
during the third stage with another assignment than fixed during the second stage, must be 
"punished" . Whenever the variable corresponding to the initial, stored assignment appears 
the process checks that the stored assignment is consistent with the current assignment. If 
eventually the whole formula passes the checking, either reward 2 or reward is obtained, 
depending on whether the formula was satisfied and the policy was not cheating, or not. 
Let Cc,b be that instance of the third stage where it is checked whether always gets 
assignment b. It is essentially the same deterministic process as defined in the proof of 
Theorem 4.1, but whenever an assignment to a literal containing Xc is asked for, if Xc does 
not get assignment b the process goes to state s cheat- Otherwise, the process goes to state 
Send- If the assignment chosen by the policy satisfied the formula, reward 2 is obtained; 
otherwise the reward is 0. 

The overall structure of M($) is sketched in Figure 4. Note that the dashed arrows 
represent random transitions (all of equal probability, irregardless of the action chosen), 
solid arrows represent deterministic transitions corresponding to the action True, dotted 
arrows represent deterministic transitions corresponding to the action False, and dot-dash 
arrows represent transitions that are forced, whatever the choice of action. 

Consider a formula $ G SSAT with variables xi,X2, ■ ■ ■ ,Xn and r random quantifiers, and 
consider M($). Because the third stage is deterministic, the process has 2n-2^ trajectories, 
n ■ 2'' of which reach stage 3. Now, assume that tt is a policy, which is consistent with the 

3. We can also regard the interaction between the process and the policy as an interactive proof system, 
where the policy presents a proof and the process checks its correctness. 
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Figure 4: A sketch of M($). 



observations from the n steps during the second stage, i.e., whenever it is "asked" to give an 
assignment to a variable (during the third stage), it does so according to the observations 
during the second stage and therefore it assigns the same value to every appearance of a 
variable in Ck,a- Because $ G SsAT, a fraction of more than 1 — 2~^' of the trajectories that 
reach stage 3 correspond to a satisfying assignment and are continued under this policy tt 
to state Send where they receive reward 2. Hence, the history-dependent value of M($) is 
> i ■ (1 - 2"'=) -2 = 1- 2"^ 

For $ ^ SsAT, an inconsistent (or cheating) policy on M($) may have performance 
greater than 1 — 2^"^. Therefore, we have to perform a probability amplification as in the 
proof of Theorem 4.7 that punishes cheating. We construct Mjt(<i?) from k copies Mi, . . . , Mj^ 
of M($) (the exact value of k will be determined later), such that the initial state of Mfc($) 
is the initial state of Mi, and the initial state of Mj+i is the final state Sgnd of Mj. If in 
some repetition a trajectory is caught cheating, then it is sent to the "penalty box" and is 
not continued in the following repetitions. Hence, it cannot collect any more rewards. More 
formally, the states s cheat of all the MjS are identified as a unique sink state of Mk{<^). 

If $ G SsAT, then in each round (or repetition), expected rewards > 1 — can be 
collected, and hence the value of M^($) is > A; • (1 — 2^*^). 

Consider a formula $ SSAT. Then a non-cheating policy for Mk{^) has performance 
less than k ■ 2~'^. Cheating policies may have better performances. We claim that for all A;, 
the value of M^ ($) is at most k, ■ 2^*^ + 2n. The proof is an induction on k. Consider Mi ($) , 
which has the same value as M(<1>). Hence, the value of Mi($) is at most 1. As an inductive 
hypothesis, let us assume that M;t(^) has value at most k ■ 2^^ + In. In the inductive step, 
we consider Mfc_|_i($), i.e. M($) followed by Mfc($). Assume that a policy TTj cheats on j 
of the 2^ assignments. From the n trajectories that correspond to an assignment, at least 
1 is trapped for cheating under a cheating policy, and at most n — 1 may obtain reward 2. 
Then the reward obtained in the first round is at most 2"^^ + 2 • ^-^p^-, and the rewards 
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obtained in the following rounds are multiplied by 1 — ^^^^ ■, because a fraction of 2^.2^ of 
the trajectories are sent to the penalty box. Using the induction hypothesis, we obtain the 
following upper bound for the performance of Mk+i{^) under nj for an arbitrary j. 

per/;(M,+i($),7r,) < (^2^ + ^-l!^^) + ^1 - ^) . val(M,($)) 
< (2-^ + '^^^^) + (l-^]-(k-2-'^ + 2n 



n-2'- / V 2n-2'- 



2^ \n 2n / 

< (A; + 1) ■ 2"^ + 2n 



This completes the induction step. Hence, we proved that, for $ SSAT and for every k, 
the value of Mjt($) is at most k ■ 2~'^ + 2n. 

Eventually, we have to fix the constants. We choose c such that 2^ > This 
guarantees that 

(1 -e) • (1 - 2"'=) - 2-^= > 0. 

Next, we choose k such that 

2n<k-{{l-e)- (1 -2-^) -2-^). 

Let M($) be the POMDP that consists of k repetitions of M(<I') as described above. 
Because k is linear in the number, n, of variables of $ and hence linear in the length of 
$ can be transformed to M($) in polynomial time. The above estimates guarantee that 

value of M(i) for $ SSAT < A; ■ 2"^ + 2n < (1 - e) ■ A; ■ (1 - 2"^) . 

The jight-hand side of this inequality is a lower bound for an e-approximation of the value 
of M($) for $ G SsAT. Hence, 

• if $ G SSAT, then M($) has value > A; ■ (1 - 2^"), and 

• if $ SSAT, then M($) has value < (1 - e) ■ A; ■ (1 - 2"^). 

Hence, a polynomial-time e-approximation of the value of M($) shows whether $ is in 
SSAT. 

Concluding, let A be any set in PSPACE. There exists a polynomial-time function / 
which maps every instance a; of j4 to a bounded error stochastic formula f{x) = with 
error 2^"^ and reduces A to SSAT. Transform into the POMDP M{^x)- Using the e- 
approximate value of M{^x)j one can answer "$3; G SSAT?" and hence a; G ^4 in polynomial 
time. This shows that A is in P, and consequently P = PSPACE. □ 

Corollary 4.12 Let < e < 1. The history- dependent value of POMDPs with general 
rewards is e-approximable if and only ifP = PSPACE. 
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5. MDPs 

Calculating the finite-horizon performance of stationary policies is in GapL (Mundhenk 
et al., 2000), which is a subclass of the class of polynomial time computable functions. The 
stationary policy existence problem for MDPs is shown to be P-hard by Papadimitriou and 
Tsitsiklis (1986), from which it follows that finding an optimal stationary policy for MDPs 
is P-hard. So it is not surprising that approximating the optimal policy is also P-hard. We 
include the following theorem because it allows us to present one aspect of the reduction 
used in the proof of Theorem 5.2 in isolation. 

Theorem 5.1 The problem of k-additive approximating the optimal stationary policy for 
MDPs is P-hard. 

The proof shows this for the case of non-negative rewards; the unrestricted case follows 
immediately. By Proposition 3.2, this shows that finding a multiplicative approximation 
scheme for this problem is also P-hard. 

Proof Consider the P-complete problem CvP: given a Boolean circuit C and input a;, is 
C{x) = 1? A Boolean circuit and its input can be seen as a directed acyclic graph. Each 
node represents a gate, and every gate has one of the types AND, OR, NOT, or 1. The 
gates of type or 1 are the input gates, which represent the bits of the fixed input x to the 
circuit. Input gates have indegree 0. All NOT gates have indegree 1, and all AND and OR 
gates have indegree 2. There is one gate having outdegree 0. This gate is called the output 
gate, from which the result of the computation of circuit C on input x can be read. 

Prom such a circuit C, an MDP M can be constructed as follows. Because the basic 
idea of the construction is very similar to one shown in (Papadimitriou & Tsitsiklis, 1986), 
we leave out technical details. As an initial simplifying assumption, assume that the circuit 
has no NOT gates. Each gate of the circuit becomes a state of the MDP. The start state 
is the output gate. Reverse all edges of the circuit. Hence, a transition in M leads from a 
gate in C to one of its predecessors. A transition from an OR gate depends on the action 
and is deterministic. On action its left predecessor is reached, and on action 1 its right 
predecessor is reached. A transition from an AND gate is probabilistic and does not depend 
on the action. With probability ^ the left predecessor is reached, and with probability ^ 
the right predecessor is reached. 

Continue considering a circuit without NOT gates. If an input gate with value 1 is 
reached, a large positive reward is gained, and if an input gate with value is reached, no 
reward is gained, which makes the total expected reward noticeably smaller than otherwise. 
If C(a;) = 1, then the actions can be chosen at the OR gates so that every trajectory reaches 
an input gate with value 1; if this condition holds, then it must be that C{x) = 1. Hence, 
the MDP has a large positive value if and only ii C{x) = 1. 

If the circuit has NOT gates, we need to remember the parity of the number of NOT 
gates on each trajectory. If the parity is even, everything goes as described above. If the 
parity is odd, then the role of AND and OR gates is switched, and the role of and 1 gates 
is switched. If a NOT gate is reached, the parity bit is flipped. For every gate in the circuit, 
we now take two MDP states: one for even and one for odd parity. Hence, if G is the set of 
gates in C, the MDP has states G x {0, 1}. The state transition function is 
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if (s is an OR gate and p = 0) or (s is an AND gate 
and p = 1), p = p' , and s' is predecessor a of s; 
if (s is an OR gate and p = 1) or (s is an AND gate 
and p = 0), p = p', and s' is predecessor of s; 
if (s is an OR gate and p = 1) or (s is an AND gate 
and p = 0), p = p', and s' is predecessor 1 of s; 
if s is a NOT gate and s' is a predecessor of s and 
p' = 1 - p; 

if s is an input gate or the sink state, and s' is the sink 
state. 

Now we have to specify the reward function. If an input gate with value 1 is encountered 
on a trajectory where the parity of NOT gates is even, then reward 21'^!"'"*^+^ is obtained, 
where |C| is the size of circuit C. The same reward is obtained if an input gate with value 
is encountered on a trajectory where the parity of NOT gates is odd. All other trajectories 
obtain reward 0. 

Thus each trajectory receives reward either or 2l'^l+'^+^. There are at most 21*^1 tra- 
jectories for each policy. If a policy chooses the correct values for all the gates in order to 
prove that C{x) = 1, in other words if C{x) = 1, then the expected value of an optimal 
policy is 21*^1+*^+^. Otherwise, the expected value is at least 2l'^l"'"*'"'"^/2l'^l > 2k lower than 
2|C|+fc+i^ i.e., at most 2\^\+^+^ - 2k. 

Thus, if an approximation algorithm is within an additive constant k of the optimal 
policy, it will either give a value > 21'-^!"'"^"'"^— A; or < 21'^!"'"^"'"^— /J. By inspection of the output, 
one can immediately determine whether C{x) = 1. Thus, any A;-additive approximation for 
this problem must take at least polynomial time. 

In Figure 5, an example circuit and the MDP to which it is transformed are given. Every 
gate of the circuit is transformed to two states of the MDP: one copy for even parity of 
NOT gates passed on that trajectory (indicated by a thin outline of the state) and one copy 
for odd parity of NOT gates passed on that trajectory (indicated by a thick outline of the 
state). A solid arrow indicates the outcome of action "choose the left predecessor", and a 
dashed arrow indicates the outcome of action "choose the right predecessor" . Dotted arrows 
indicate a transition with probability ^ on any action. The circuit in Figure 5 has value 
1. The policy, which chooses the right predecessor in the starting state, yields trajectories 
which all end in an input gate with value 1 and which therefore obtains the optimal value. 

□ 

There have been several recent approximation algorithms introduced for structured 
MDPs, many of which are surveyed in (Boutilier et al., 1999). More recent work includes 
a variant of policy iteration by Koller and Parr (2000) and heuristic search in the space of 
finite controllers by Hansen and Feng (2000) and Kim et al. (2000). While these algorithms 
are often highly effective in reducing the asymptotic complexity and actual run times of 
policy construction, they all run in time exponential in the size of the structured represen- 
tation, or offer only weak performance guarantees. We show that exponential asymptotic 
complexity is necessary for any algorithm scheme that produces e-approximations for all 
e. For this, we consider MDPs represented by 2TBNs (Boutilier, Dearden, & Goldszmidt, 
1995). Until now, we have described the state transition function for MDPs by a function 
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Figure 5: A circuit, the MDP it is reduced to, and the trajectories according to an optimal 
poHcy for the MDP 



t{s, a, s') that computes the probability of reaching state s' from state s under action a. 
We assumed that the transition function was represented explicitly. A two-phase temporal 
Bayes net (2TBN) is a succinct representation of an MDP or POMDP. Each state of the 
system is described by a vector of values called fluents. (Note that if each of n fluents is 
two- valued, then the system has 2" states.) Actions are described by the effect they have 
on each fluent by means of two data structures. They are a dependency graph and a set 
of functions encoded as conditional probability tables, decision trees, arithmetic decision 
diagrams, or in some other data structure. 

The dependency graph is a directed acyclic graph with nodes partitioned into two sets 
. . . , and {v[, . . . ,v!^}. The first set of nodes represents the state at time i, the 
second at time t + 1. The edges are from the first set of nodes to the second (asynchronous) 
or within the second set (synchronous). The value of the A;*'* fluent at time t+1 under action 
a depends probabilistically on the values of the predecessors of f ^ in this graph. (Note that 
the synchronous edges must form a directed, acyclic graph in order for the dependencies to 
be evaluated.) The probabilities are spelled out, for each action, in the corresponding data 
structure for wj^ and a. We will indicate that (stochastic) function by fk. 

We make no assumptions about the structure of rewards for 2TBNs. In fact, the final 
2TBN constructed in the proof of Theorem 5.2 has very large rewards which are computed 
implicitly; in time polynomial in the size of the 2TBN, one can explicitly compute any 
individual bit of the reward. This has the effect of making the potential value of the 2TBN 
too large to write down with polynomially many bits. 

Theorem 5.2 The problem of k-additive approximating any optimal stationary policy for 
an MDP in 2TBN-representation is EXP-hard. 

Proof The general strategy is similar to the proof of Theorem 5.1. We give a reduction 
from the EXP-complete succinct circuit value problem to the problem for MDPs in 2TBN- 
representation. An instance of the succinct circuit value problem is a Boolean circuit S that 
describes a circuit C and an input x, i.e. S describes an instance of the "flat" circuit value 
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problem. We can assume that in C, each gate is a predecessor to at most two other gates. ^ 
Then every gate in C has four neighbors, two of which output the input to C, and two of 
which get the output of C as input (if there are fewer neighbors, the missing neighbors are 
set to a fictitious gate 0). Consider a gate i of C. Say that the output of neighbors and 1 is 
the input to gate i, and the output of gate i is input to neighbors 2 and 3. Now, the circuit 
5* on input outputs (i, .s), where gate j is the A;*'* neighbor of gate «, and s encodes 

the type of gate i (AND, OR, NOT, 0, and 1). The idea is to construct from C an MDP 
M as in the proof of Theorem 5.1. However, we do it succinctly. Hence, we construct from 
S a 2TBN-representation of an MDP M{S). The actions of M(5) are and 1, for choosing 
neighbor of the current state-gate, or respectively, neighbor 1. The states of M{S) are 
tuples («,p, t, r) where « is a gate of C, p is the parity bit — as in the proof of Theorem 5.1, 
t is the type of gate i, and r is used for a random bit. Every gate number i is given in 
binary using — say — / bits. Then, the 2TBN has / + 3 fluents 21,225 • • • ^iuP^t^r. Let 
/i; /2) • • • ) fh fpj ft: fr be the stochastic functions that calculate ii,i'2, ■ ■ ■ , 'i'i,p', t',r'. The 
simplest is fr for the fluent r' that is used as random bit if from state i = ii ■ ■ - ii the next 
state is chosen uniformly at random from one of the two predecessors of gate i in C. This 
happens if the type t of gate i is AND and the parity p is 0, or if t is OR and p is 1. In 
these cases, r' determines its value or 1 by flipping a coin. Otherwise, r' equals 1. Notice 
that r' is independent of the action. 

The functions /c for the fluents v'^. determine the bits of the next states. If t is an AND 
and the parity p is even, then "randomly" a predecessor of gate i is chosen. "Randomly" 
means here that the random bit r' determines whether predecessor or predecessor 1 is 
chosen. Hence, v'^ is the c*'* bit of j, where (j, s) is the output of S on input (i,r'). 
Accordingly, i' = s is the type of the chosen gate, and p' = p remains unchanged. The same 
happens if t is an OR and the parity p is odd. If i is a NOT, there is only one predecessor 
of i, and that one must be chosen for i' and t' . The parity bit p' is flipped to 1 — p. If t is an 
OR and the parity p is even, then on action a G {0, 1}, the predecessor a of gate i is chosen. 
Hence, v'^ is the c*'* bit of j, where (j, s) is the output of S on input («,a). Accordingly, 
t' = s and p' = p. The same happens if t is an AND and the parity p is odd. Hence, the 
function /c can be calculated as follows. 

input i, r', a 

if {t = OR and p = 0) or (t = AND and p = I) 

then calculate S{i,a) = {j,s); 
else if {t = OR and p = 1) ov {t = AND and p = 0) 

then calculate S{i,r') = {j,s) 
else if t = NOT 

then calculate S{i,0) = {j,s) 
else J = 

output the c*'* bit of j 

The state is a sink state which is reached from the input gates within one step and 
which is never left. The type t' of the next state or gate is calculated accordingly. 

4. If this is not the case, and d is the maximum out-degree of a gate, we can replace the circuit by one with 
maximum out-degree 2 and size at most logd larger. Since d < \C\, such a substitution will not affect 
the asymptotic complexity of any of our algorithms. 
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One can also simulate the circuit S for function fk in the above algorithm by a 2TBN. 
Note that, in general, circuits can have more than one output. We consider this more 
general model here. 

Claim 1 Every Boolean circuit can be simulated by a 2TBN, to which it can be transformed 
in polynomial time. 

Proof We sketch the construction idea. Let i? be a circuit with n input gates and n' 
output gates. The outcome of the circuit on any input 61, . . . ,6„ is usually calculated as 
follows. At first, calculate the outcome of all gates that get input only from input gates. 
Next, calculate the outcome of all gates that get their inputs only from those gates whose 
outcome is already calculated, and so on. This yields an enumeration of the gates of a 
circuit in topological order, i.e., such that the outcome of a gate can be calculated when 
all the outcomes of gates with a smaller index are already calculated. We assume that the 
gates are enumerated in this way, and that gi, . . . , g„ are the input gates, and that gi, . . . ,gs 
are the other gates, where the smallest index of a gate which is neither an output nor an 
input gate equals / = max (n,n') + 1. 

Now, we define a 2TBN T simulating R as follows. T has a fluent for every gate of 
R, say fluents vi, . . . ,Vs. The basic idea is that fluents wi, . . . , i)„ represent the input gates 
of R. In one time step, values are propagated from the input nodes wi, . . . ,f„ through all 
gate nodes v[,...,v'g, and the outputs copied to v[, . . . , v'^,. The dependency graph has the 
following edges according to the "wires" of the circuit R. If an input gate gi {I < i < n) 
outputs an input to gate gj, then we get an edge from Vi to v'j. If the output of a non-input 
gate gi {n < i < s) is input to gate gj, then we get an edge from to v'j. Finally, the nodes 
v'l, . . . ^v'j^i stand for the value bits. If gate gj produces the ith output bit, then there is an 
edge from Vj to f'. Because the circuit R has no loop, the graph is loop-free, too. 

The functions associated to the nodes v[,...,v'g depend on the functions calculated by 
the respective gate and are as follows. Each of the value nodes for i = 1, 2, . . . , n', which 
stands for the input bits, has exactly one predecessor, whose value is copied into v[. Hence, 
fi is the one-place identity function, fi{x) = x with probability 1, for « = 1, 2, . . . , n'. Now 
we consider the nodes which come from internal gates of the circuit. If (^j is an AND gate, 
then fi{x,y) = x Ay, where x and y are the predecessors of gate g^. If g^ is an OR gate, 
then fi{x,y) = a; V y, and ii gi is a NOT gate, then fi{x) = -^x, all with probability 1. 

By this construction, it follows that the 2TBN T simulates the Boolean circuit R. Notice 
that the number of fluents of T is at most the double of the number of gates of R. The 
transformation from R to T can be performed in polynomial time. □ 

An example of a Boolean circuit and the 2TBN to which it is transformed as described 
above is given in Figure 6. 

Now, we can construct from the circuit S that is a succinct representation of a circuit 
C a 2TBN T5 with fluents ii,i2, . . . ,ii,p,t,r as already deflned, plus additional fluents for 
the gates of 5, using the technique from the above Claim. Taking the action a, the parity 
p, the gate type t and the random bit r' into account, we can construct Ts — according to 
the description of function fc above — so that fluent Vj^ contains the bit described by the 
function /c(«i, 22, ■ ■ ■ ■,ihP-,t-,i') above. Notice that the function fj^ for v'^^ is dependent only 
on the predecessors of the gate of S represented by , the fluents p, t, r', and the action a. 
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output output 
gate 1 gate 2 




input input 
gate 1 gate 2 




1 1 
1 1 1 



Figure 6: A Boolean circuit which outputs the binary sum of its input bits, and a 2TBN 

representing the circuit. Only functions fi (the identity function), (simulating 
a NOT gate), (simulating an AND gate), and /g (simulating an OR gate) are 
described) . 
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Hence, it has at most 6 arguments and can be described by a small table. This holds for 
all fluents of Ts- Finally, the function fc for Ts just copies the value of v'j^ into v'^. Hence, 
from S we can construct a MDP in 2TBN representation similar to the MDP in the proof 
of Theorem 5.1. Next, we specify the rewards of this MDP. The reward is 2^''^''''*''''^ if any 
action is taken on a state representing an input gate with value 1 and parity 0, or with value 

and parity 1. Otherwise, the reward equals 0. This reward function can be represented 
by a circuit, which on binary input i, a, b outputs the b^^ bit of the reward obtained in state 

1 on action a. (Since it requires 2l'^l+'^+^ bits to represent the reward, b can be represented 
using only |5| + A; + 1 bits.) 

If C(2;) = 1, then there is a choice of actions for each state that gives reward 2 on 
every trajectory, similar to the proof of Theorem 5.1. However, if C(a;) = 0, any policy has 
at least one trajectory that receives a reward. NoWj there are at most 2^'^' trajectories, 
and therefore there is a gap of at least 2^''^''''''^^/2^'^ > 2k between possible values. As 
above, we conclude that any A;-additive approximation to the factored MDP problem gives 
a decision algorithm for the succinct circuit value problem. Therefore, the lower bound of 
EXP-hardness for the factored MDP value problem holds for this approximation problem 
as well. □ 

The following structured representation is more general than the representations more 
common to the Al/planning community. We say that an MDP has a succinct representation, 
or is a succinct MDP, if there are Boolean circuits C( and Cr such that Ct{s, a, s', i) produces 
the ith bit of the transition probability i(s, a, s') and Cj.{s, a, i) produces the ith bit of the 
reward r(s, a). Similar to the proof of Theorem 5.2, we can also prove nonapproximability 
of MDP values for succinctly represented MDPs. 

Theorem 5.3 The problem of k- additive approximating the optimal stationary policy for a 
succinctly represented MDP is EXP -hard. 

6. Non-Approximability for Infinite-Horizon POMDPs 

The discounted value of an infinite-horizon POMDP is the maximum total discounted per- 
formance. When we discuss the policy existence problem or the average case performance 
in the infinite horizon, it is necessary to specify the reward criterion. We generalize the 
value function as follows. 

The a,(3-value YalajiM) of M is M's maximal /3-performance under any policy tt of 
type a, i.e. vala,a(M) = max^^n,, per/ ,g(M, tt). 

Note that a time-dependent or history-dependent infinite-horizon policy for a POMDP is 
not necessarily finitely representable. For fully-observable MDPs, it turned out (see e.g. Put- 
erman, 1994) that the discounted or average value is the performance of a stationary policy. 
This means that no history-dependent policy performs better than the best stationary one. 
As an important consequence, an optimal policy is finitely representable. For POMDPs, 
this does not hold. Madani et al. (1999) showed that the time-dependent infinite-horizon 
policy-existence problem for POMDPs is not decidable under average performance or under 
total discounted performance. In contrast, we show that the same problem for stationary 
policies is NP-complete. 
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Theorem 6.1 The stationary infinite-horizon policy- existence problem for POMDPs under 
total discounted or average performance is l>iP -complete. 

The hardness proof is essentially the same as for Theorem 4.1. Note that in that 
construction, every stationary policy obtains reward 1 for at most one step, namely when 
sink state T is reached, meaning that the formula is satisfied. All other steps yield reward 

0. Therefore, for this construction, the total discounted value is greater than if and only if 
the finite-horizon value is so. To make the construction work for average value, we have to 
modify it such that once the sink state T is reached, every subsequent action brings reward 

1. Therefore, the average value equals 1 if the formula is satisfiable, and it equals if it is 
unsatisfiable. Hence, both the problems are NP-hard. 

Containment in NP for the total discounted performance follows from the guess-and- 
check approach: Guess a stationary policy, calculate its performance and accept if and only 
if the performance is positive. The total discounted and the average performance can both 
be calculated in polynomial time. 

In the same way, the techniques proving nonapproximability results for the stationary 
policy in the finite horizon case (Corollary 4.2) can be modified to obtain nonapproxima- 
bility results for infinite horizons. 

Theorem 6.2 The stationary infinite-horizon value of POMDPs under total discounted or 
average performance can be e- approximated if and only if P = NP. 

The infinite-horizon time-dependent policy-existence problems are undecidable (Madani 
et al., 1999). We show that no computable function can even approximate optimal policies. 

Theorem 6.3 The time- dependent infi,nite-horizon value of unobservable POMDPs under 
average performance cannot be e-approximated. 

The proof follows from the proof by Madani et al. (1999) showing the uncomputability 
of the time-dependent value. In Madani et al. (1999), from a given Turing machine T an 
unobservable POMDP is constructed having the following properties for arbitrary 6 > 0. 
(1) If T halts on empty input, then there is exactly one time-dependent infinite- horizon 
policy with performance > 1 — ^, (2) all other time-dependent policies have performance 
< 6, and (3) the average value is between and 1. This reduces the undecidable problem 
of whether a Turing machine halts on empty input to the time-dependent infinite-horizon 
policy existence problem for unobservable POMDPs under average performance. Actually, 
assuming that the value of the unobservable POMDP were e-approximable, we could choose 
5 in a way that even the approximation enables us to decide whether T halts on empty input. 
Since this is undecidable, an e-approximation is impossible. 

Corollary 6.4 The time- dependent and history -dependent infinite-horizon value of PO- 
MDPs under average performance cannot be e-approximated. 
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Appendix A. Proof of Theorem 4.1 



We present the reduction from (Mundhenk et al., 2000). Let ^{xi, . . . , Xn) be an instance of 
3Sat with variables xi,. . . ,Xn and clauses Ci, . . . , Cm, where clause Cj = {lv{i,j) V lv(2,j) V 
lv(3,j)) foi" h G We say that variable Xi appears in Cj with signum (resp. if 

-iXj (resp. Xi) is a literal in Cj. 



Prom 0,we construct a POMDP M{4>) = {S,so,A,0,t,o,r) with 

S = {(«,j) I 1 <2 <n, 1 < j <m}U{F,T} 
so = (^;(1,1),1), A = {0,1}, O 



\^x\ , . . . , Xji^ Ty 



t{s, a, s') 



r{s,a) 



1, 

1, 

1, 

1, 

1, 
0, 

1, 
0, 



if s 
if s 
if s 



{v{:i,j),j),s' = {v{hj + + <m,l<i<3, 
and j) appears in Cj with signum a 
{v{i, m),m), = T, 1 < i < 3, 

and m) appears in Cm with signum a 
{v{i,j)J),s' = {v{i + l,j),j),l <i<3, 

and 2;„(j,j) appears in Cj with signum 1 — a 
ifs = ivi3J),j),s' =F, ' 

and ^t,(3j) appears in Cj with signum 1 — a 
a s = s' = F oi s = s' = 't 
otherwise 

otherwise \ „ „ 

\ F, if s = F . 



Note that all transitions in M{(f)) are deterministic, and every trajectory has value or 
1. There is a correspondence between policies for M{4>) and assignments of values to the 
variables of 4', such that policies under which M(0) has value 1 correspond to satisfying 
assignments for 0, and vice versa. 



Appendix B. Proof of Theorem 4.6 

Again, we present the reduction from (Mundhenk et al., 2000). Let </> be a formula with n 
variables xi, . . . ,Xn and m clauses Ci, . . . , Cm- This time, we define the unobservable MDP 
M{4) = {S,so,A,t,r) where 



S = {{ij) \1 <i <n,l < j <m}U {sati\l <i <n}U {so,T,F} 
A = {0,1} 
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if s = So, s' = (1, j), 1 <j <m 

if s = (i, j), s' = sati^i,i < n, Xi appears in Cj with signum a 
if s = = {i + < n, 

Xi does not appear in Cj with signum a 
if 5 = {n,j), s' = T, Xn appears in Cj with signum a 
if s = {n,j), s' = F, Xn does not appear in Cj with signum a 
if s = sati, s' = sati+i,i < n 
if s = satn, s' = T 

[{ s = s' = F oi s = s' = T, a = ox a = 1 
otherwise 

if .s 7^ T and t{s,a,T) > 
otherwise . 
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